This project was completed for the Urban Spatial Analytics/Smart Cities Practicum course (MUSA 790), instructed by Michael Fichman and Matthew Harris. We are grateful to our instructors for their continued support and feedback. We are also grateful to our client Daniel Sebastian Padilla Ochoa from Ocean Conservancy for his guidance and support throughout this practicum project.

Introduction

Abstract

Marine litter continues to pose as a major threat to people, animals, and the environment. While zero-waste pilot programs like Urban Ocean have made strides in reducing marine litter, there persists difficulty in site-selection of pilots due to lacking litter information. This adds unnecessary barriers to the siting of pilot zero-waste pickup programs. By leveraging geospatial machine learning, this project aims to explore how best to navigate zero-waste siting in areas facing bleak marine litter conditions.

Client Background

Image Source: Ocean Conservancy Image Source: The Independent

Image Source: Ocean Conservancy, The Independent

Urban Ocean aims to prevent ocean plastic pollution by supporting projects in cities address plastic pollution and infrastructure resilience. This includes reducing the amount of single-use plastic produced, improving waste management systems, promoting circular economy principles, and building awareness. Urban Ocean is an accelerator project created out of a trilateral initiative/partnership between Circulate Initiative, Resilient Cities Network, and Ocean Conservancy:

  • Resilient Cities Network is a network of cities bringing together global knowledge, practice, partnerships, and funding to empower its members through building equity and resilience. The organization supports on-the-ground projects and solutions to facilitate connections and information-sharing between communities and local leaders.
  • Ocean Conservancy is a non-profit environmental advocacy group working to advance marine health and protection. Ocean Conservancy focuses on science-based solutions for a healthy ocean and the wildlife and communities that depend on it.
  • The Circulate Initiative is a non-profit organization that works to solve the plastic pollution challenge and build circular and equitable economies across emerging markets. The organization focuses on research, high-impact programs, and stakeholder collaboration to forward sustainability and equitable growth.

This program takes multiple approaches to address ocean plastic pollution. First, they mobilize cleanup campaigns across each city based on volunteer efforts with international and local support groups, such as the International Coastal Cleanup. Second, they research each city’s sanitation and urban systems to identify key details regarding what causes trash to end up in oceans and waterways, sharing findings and recommending action plans to local communities and decision makers and working with stakeholders directly to create changes in products, practices, and behaviors as alternative solutions to their current needs. With these efforts, each campaign requires an evaluation of local context and detailed understanding of the city’s systems and dynamic. This aspect of the process requires long-term planning and methodical decision-making to understand the pre-existing conditions in place. Hence, our model aims to streamline the process of understanding the general structure of each city-context and provide a comprehensive solution on where efforts and resources should be allocated to.

We based our focus and model on the current cities from the Urban Ocean program. The 12 participating cities a part of the Urban Ocean program as of May 2024 are:

  1. Can Tho, Vietnam
  2. Chennai, India
  3. Melaka, Malaysia
  4. Mumbai, India
  5. Panama City, Panama
  6. Pune, India
  7. Santiago, Chile
  8. Semarang, Indonesia
  9. Surat, India
  10. Bangkok, Thailand
  11. Santa Fe, Argentina
  12. Salvador, Brazil

Motivation and Use Case

As marine litter worsens across the world, litter accumulation poses significant cyclical health, biodiversity, and climate risks. For example, an estimated 11 million metric tons of plastic waste enter the ocean every year, and this number expected to triple by 2040. Particularly in developing countries with lacking waste disposal systems and recycling programs, litter accumulation poses a much steeper issue that is often difficult to solve systemically. This often disproportionately impacts vulnerable coastal dwellers and fisherman who heavily rely on the marine ecosystem. The trilateral partnership program, Urban Ocean, has been developing initiatives that mitigate marine pollution, assess waste management, and enable cities to address ocean plastics and resilience. These projects intend to deploy “zero-waste” pilot strategies in cities around the world. With this wide-spanning outreach and diversity in partnerships, an assessment on site selection and resource allocation becomes a repeated step in the process that hinders efficiency each waste-reduction campaign.

Our objective is to develop a site selection and assessment tool that identifies effective zero-waste site locations for national and multinational use. This geospatial risk assessment model serves the purpose of predicting litter accumulation based on globally sourced data with a repeatable framework on an international scale. The results of the model aim to evaluate which sections of a given area have the highest likelihood to produce and contain litter relative to its surroundings. For our case, the Urban Ocean program intends to dedicate “zero-waste” solutions in these areas for the most effective impact for each deployment.

Our resulting dashboard is to support the narrative-building and decision-making of zero-waste pilots and attract funding sources and other support for these cities.

Exploratory Data Analysis

Data Sources

Our project involved a range of data sources for model building and finding key insights into the current state of marine litter. The exploratory data analysis primarily focused on extracting information from global sources to ensure scalability of the model scope and application.

Marine Debris Tracker

Our primary dataset was existing marine debris/litter from the Marine Debris Tracker powered by Morgan Stanley in collaboration with the National Geographic Society and the University of Georgia. The Marine Debris Tracker is a web application and public use database that tracks user-recorded global litter pick-up projects and initiatives from participating organizations. The marine litter data scope selected for this analysis was between January 2021 - February 2024. This period was selected because it was the largest period of data collection time post-COVID-19.

OpenStreetMap Data

OpenStreetMap (OSM) data was used to provide additional context to the litter data. OSM featured land-use, proximity to water, waste site facilities, roads, and other indicators of commercial activity. The aim of these variables was to act as proxies for litter cases, providing insight on where litter most likely accumulated or ends up. Each variable was then placed on the fishnet grid with the litter data, computed using a Chi-Squared test, and evaluated for association between the variables and litter.

Other Data

For our model, we directed our focus on data modeling from the Urban Ocean cities. To develop our model for each respective city (or future use), each area was determined by a ‘boundary box’ that was created by a custom KML file using Google My Maps or sourced online. Each boundary’s drawing was dictated by political boundaries and practical scope of where data was recorded within the city, further methodology of each individual city is detailed on the interactive dashboard.

Additionally, we also hypothesized population density as having a major influence on our model. We used Data for Good at Meta’s global high resolution population density maps. This global dataset works as a proxy for population density.

Variables

The litter data features a wide variety of item categories and general information about each piece. Despite the quality of detail for each recorded sample, this only accounts for litter that has been identified, recorded, and disposed of. This data does not account for litter that was identified but never disposed of, accounted for, or assumed. Regardless, areas where litter was not recorded in a general surrounding could not confidently be assumed to be present or not present. Hence, the purpose of the model is to predict where litter is most likely to accumulate within the city based on the data available.

Figure 2 shown below depicts the dependent variables used for model development. Each serve their own purpose in characterizing the city that is evaluated. Social & economic activities are depicted using restaurants and population density. Restaurants are an indication of commercial corridors and general ‘downtown’ aesthetic that attracts large populations, typically located in central hubs of urban activity. Their presence, especially when in clusters, can indicate a higher likelihood of commercial hubs and in turn program the model to classify these areas as major activity. Another metric for characterizing a city landscape is population density. Regardless of population itself, density is a key factor in determining where people are located in relation to the rest of the city. This metric supports our geospatial assessment in determining relative risk of litter accumulation, rather than independent probability.

Geomorphology describes the city’s physical geography for the model. Trash, when moved around by nature, weather, human activity, can a higher probability of moving towards lower elevations and bodies of water. Using water distribution as a factor accomplishes two pieces of information in one variable. Naturally, water distribution displays where water is located. When trash makes contact with a body of water, it rarely exists, requiring cleanup intervention to remove it, acting as an indication of higher risk. Secondly, it acts as a general but not definite idea of terrain. Proximity to water does not fully account or indicate for elevation, but it does provide a general idea of a path of travel for loose trash naturally moving.

Waste management or waste facilities serve the purpose off collecting, storing, transporting, recycling, and disposing of waste in an area to prevent or minimize adverse effects on the environment and public health of an area. These facilities act as a factor of litter risk because of their location in relation to the city center, where litter has been located, and their capabilities to cover their assign jurisdiction. Their critical role in the sanitation infrastructure and system in a city is a key factor in understanding any operation weaknesses.

Landuse is a general indicator of the city’s zoning and urban planning. Industrial zones, for example, are likely to have litter accumulation due to the nature of the activities. Residential zones, on the other hand, are less likely to have litter accumulation due to the opposite nature. The model uses landuse to potentially discover patterns in the city that may not necessarily be identified by intuition alone, but is still important is creating the picture of how an area is structured. With this generalization in mind, zoning alone is not enough for describing landuse. Roads also are a vaiable incorporated into the model to frame the network of the city, also acting as a proxy to density. Grid streets or segments closer together often represent denser, potentially taller buildings that represent activity, as well as vise-versa. Hence, each variable is a general feature known to be present in every city and allows for replicated use to any place in the world.

Feature Engineering

Understanding the Data

After the boundary box was created, the data was then projected to a fishnet grid. The fishnet grid was used to count the number of litter points in each cell in a format that can be compared to by each grid cell in the city.

Following data extrapolation, the litter files were overlaid on a 500x500 meter fishnet grid to create a uniformed data structure for each city. The fishnet grid was created using the ‘sf’ package and projected to the respective city’s UTM projection. The grid was used to aggregate and visualize the litter data and other variables to create a uniformed data structure for each city. The variables used in the model were selected based on their potential association with litter accumulation. These variables included land-use, proximity to water, waste site facilities, roads, and other indicators of commercial activity. The aim of these variables was to act as proxies for litter cases, providing insight on where litter most likely accumulated or ends up. Overall the goal of the data gathering and variables chosen provide a model an understanding the city itself, and in turn understand the systems in place and the environment assessment needed.

Data Preprocessing

Each aspect of the data gathering, through manual download, creation, or API, can now be coalesced into one cohesive visual. As mentioned with the 500x500 meter grid, all variables including litter as measuring based on an attribute regarding the cells. This grid is structure by the boundary box created from Google My Maps. This boundary file projects the city and defines its boundary on where the model is evaluating. Boundary files for our data are knowingly compatible with KML or GEOJSON files, or other file types that can be converted to raster layers.

For example, our litter data is not counted as individual data points for model development, but rather a sum of points within each cell. This allows for a better indication of frequency and density of litter in each area. The same method is applied to our dependent variables as well for uniformed formatted and consistency in structure. Below code & figure 3 demonstrate the example process of importing the litter CSV, boundary KML, and the creation of the overlay used for data exploration and analysis.

Our M. With this in mind,. We then repeated this process for the rest of the Urban Ocean cities. As calling large amount of OSM data is time consuming for each city, each fishnet for each city was saved and reloaded in this process.

litter <- read.csv('https://raw.githubusercontent.com/TrevorKap/MUSA810-Marine-Pollution/main/Data/mdt-dataChennai.csv')
litter_p <- litter%>%filter(master_material == 'PLASTIC')%>%
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326, agr = "constant")%>%st_transform('EPSG:32643')

chen_bdry <- st_read('https://github.com/TrevorKap/MUSA810-Marine-Pollution/raw/main/Data/Chennai.kml')
chen_bdry <- st_set_crs(chen_bdry, 4326)%>%st_transform('EPSG:32643')
temp_bd <- st_read('https://github.com/TrevorKap/MUSA810-Marine-Pollution/raw/main/Data/Chennai.kml')

temp_bbox <- get_bbox(temp_bd) 
temp_fish <- create_fish(chen_bdry) 
final_net <- countfishnet(temp_fish, litter_p)
final_net <- pn_gen(stor_df)
final_net <- moran_gen(final_net,stor_df) 
chen_net <- final_net
city_data_list <- lapply(cities, load_city_data)
names(city_data_list) <- cities

bd_data_list <- lapply(cities, load_city_kml)
names(bd_data_list) <- bd

bd_data_meter_list <- lapply(cities, load_city_kml_meter)
names(bd_data_meter_list) <- bdM

for (i in seq_along(city_data_list)) {assign(cities[i], city_data_list[[i]])}

for (i in seq_along(bd_data_list)) {assign(bd[i], bd_data_list[[i]])}

for (i in seq_along(bd_data_meter_list)) {assign(bdM[i], bd_data_meter_list[[i]])}
base_path <- "https://raw.githubusercontent.com/TrevorKap/MUSA810-Marine-Pollution/main/Data/stored_city/"
city_names <- c("santa_fe", "semarang", "panama", "can_tho", "melaka", "salvador",
                "surat", "santiago", "bangkok", "chennai", "mumbai", "pune")
city_data <- list()

for (city in city_names) {
  file_path <- paste0(base_path, city, "_net.geojson")
  net_sf <- st_read(file_path)
  net_sf <- st_set_crs(net_sf, 32643)
  city_data[[city]] <- net_sf
}
list2env(setNames(city_data, paste0("net_", city_names)), envir = .GlobalEnv)

Data Wrangling

Given

library(grid)
cnt_wt <- function(litter,net){
  temp <- st_join(litter,net,join = st_within)
  temp <- temp %>% filter(!is.na(uniqueID))
  temp_sum <- temp %>%
    group_by(uniqueID)%>%
    summarise(count_un = n_distinct(username))
  net <- left_join(net,st_drop_geometry(temp_sum),by = 'uniqueID')
  net <- net %>%
    mutate(count_un = replace_na(count_un, 0),
         count = count/count_un,
         count = replace_na(count, 0))
  return(net)
}

net_mumbai <- cnt_wt(Mumbai,net_mumbai)
net_chennai <- cnt_wt(Chennai,net_chennai)
net_bangkok <- cnt_wt(Bangkok,net_bangkok)
net_can_tho <- cnt_wt(Can_Tho,net_can_tho)
net_melaka <- cnt_wt(Melaka,net_melaka)
net_panama <- cnt_wt(Panama_City,net_panama)
net_pune <- cnt_wt(Pune,net_pune)
net_salvador <- cnt_wt(Salvador,net_salvador)
net_santa_fe <- cnt_wt(Santa_Fe,net_santa_fe)
net_santiago <- cnt_wt(Santiago,net_santiago)
net_semarang <- cnt_wt(Semarang,net_semarang)
net_surat <- cnt_wt(Surat,net_surat)

#z-score-normalizing 

z_score_normalize <- function(x) {
  (x - mean(x)) / sd(x)
}

normal_city <- function(data){
  data <- data%>% 
    mutate(across(where(is.numeric) & -'count', z_score_normalize),
           across(where(is.numeric), ~replace(., is.nan(.), 0)))
  return(data)
}

net_salvador <- normal_city(net_salvador)
net_santa_fe <- normal_city(net_santa_fe)
net_santiago <- normal_city(net_santiago)
net_semarang <- normal_city(net_semarang)
net_surat <- normal_city(net_surat)
net_bangkok <- normal_city(net_bangkok)
net_can_tho <- normal_city(net_can_tho)
net_chennai <- normal_city(net_chennai)
net_melaka <- normal_city(net_melaka)
net_mumbai <- normal_city(net_mumbai)
net_panama <- normal_city(net_panama)
net_pune <- normal_city(net_pune)

# agg
net_bangkok <- net_bangkok %>% mutate(city = 'Bangkok',country = 'Thailand')
net_can_tho <- net_can_tho %>% mutate(city = 'Can_Tho',country = 'Vietnam')
net_chennai <- net_chennai %>% mutate(city = 'Chennai',country = 'India')
net_melaka <- net_melaka %>% mutate(city = 'Melaka',country = 'Malaysia')
net_mumbai <- net_mumbai %>% mutate(city = 'Mumbai',country = 'India')
net_panama <- net_panama %>% mutate(city = 'Panama_City',country = 'Panama')
net_pune <- net_pune %>% mutate(city = 'Pune',country = 'India')
net_salvador <- net_salvador %>% mutate(city = 'Salvador',country = 'Brazil')
net_santa_fe <- net_santa_fe %>% mutate(city = 'Santa_Fe',country = 'Argentina')
net_santiago <- net_santiago %>% mutate(city = 'Santiago',country = 'Chile')
net_semarang <- net_semarang %>% mutate(city = 'Semarang',country = 'Indonesia')
net_surat <- net_surat %>% mutate(city = 'Surat',country = 'India')

net_total <- rbind(net_bangkok,net_can_tho,net_chennai,net_melaka,net_mumbai,net_panama,net_pune,net_salvador,net_santa_fe,net_santiago,net_semarang,net_surat) %>% mutate(uniqueID = 1:n())

net_total_li <- net_total %>% dplyr::filter(count != 0)
net_total_li <- net_total_li %>%
  mutate(count = (count - mean(count)) / sd(count))

total_litter <- do.call(rbind, city_data_list)

grid.draw(
  grobTree(
    rectGrob(gp=gpar(fill="#ecf6ff", lwd=0, col = "#ecf6ff")),
    grid.arrange(
      ggplot(total_litter, aes(x = master_material, fill = master_material)) + 
        geom_bar() +
        plotTheme() +
        labs(title = "Count of Main Item Category",
             x = "Category",
             y = "Count") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
        scale_fill_manual(values = c("PLASTIC" = "#1c3aa9", "other" = "grey")) +
        guides(fill = FALSE),
      nrow = 1
    )
  )
)

Geospatial Factors

In addition to the dependent variables in their counted presence, we implemented a K-Nearest Neighbor analysis (also displayed) to determine the proximity of each data point to one another and incorporate that as an additional variable to our model. K-Nearest Neighbor (KNN) allows us to find the proximity of another variant to determine commercial/high activity spots in the area. The nearest neighbor factor was repeated for land-use, roads, restaurants, water, and waste facilities. We added KNN as a factor to place emphasis on density and location-based processing. This addition acts as a fail-safe for our model’s cell calculation given that examining count alone does not incorporate the real-world application of proximity.

Local Moran’s I measures spatial autocorrelation which measures the degree of spatial relationship between observations, measuring how closer values are clustered in a space. We measure these among the dependent and independent variables to factor in any spatial lags identified within the region. Similar to KNN in terms of purpose, this factor allows the model to understand the relationship between the dependent and independent variables and apply that relationship in its calculations for each cell.

Mapping

Following this feature engineering, the different variables were visualized in the fishnet grid below.

Litter

The visuals showing ‘count’ measure the number of parcels with the zone or number of buildings within the cells and does not account for density. The nearest neighbor maps (‘Variable’_nn) measured their proximity.

library(ggplot2)
library(stringr)
library(grid)
library(gridExtra)
map_waste <- visual_count(net_chennai, "waste")
map_waste_nn <- visual_count(net_chennai,'waste_nn')
grid.draw(grobTree(rectGrob(gp=gpar(fill="#ecf6ff", lwd=0, col = "#ecf6ff")), 
                   grid.arrange(map_waste, map_waste_nn, ncol = 2)))

### OSM Variables

The examples are, waste management facilities (waste), water (streams, ponds, canals, etc), restaurants, and zoning (industrial, residential, retail). The overall selection was based on our hypothesis of areas of human activity leading to higher litter risk. The aim of these variables is to act as proxies for litter cases, providing an indication on where litter most likely accumulated or ends up. Each variable is then placed on a combined fishnet grid with the litter data, and evaluated for association between the variables and litter.

visual_count1 <- function(net_one,variable){
  ggplot() +
    geom_sf(data = net_one, aes(fill = net_one[[variable]]), color = NA) +
    scale_fill_viridis_c(option = "mako", name = str_to_title(variable)) +
    #labs(title = paste(str_to_title(variable),"Count for the Fishnet")) +
    mapTheme()
}
map_water <- visual_count1(net_chennai,"water")
map_restaurant <- visual_count1(net_chennai,"restaurant")
map_road <- visual_count1(net_chennai,'road')
map_industrial <- visual_count1(net_chennai,'industrial')
map_residential <- visual_count1(net_chennai,"residential")
map_retail <- visual_count1(net_chennai,'retail')

grid.draw(
  grobTree(
    rectGrob(gp=gpar(fill="#ecf6ff", lwd=0, col = "#ecf6ff")),
      grid.arrange(map_water, map_restaurant, map_road,
                   map_industrial, map_residential, map_retail, nrow = 2)
    )
  )

map_water_nn <- visual_count1(net_chennai,'water_nn')
map_restaurant_nn <- visual_count1(net_chennai,"restaurant_nn")
map_road_nn <- visual_count1(net_chennai,'road_nn')
map_industrial_nn <- visual_count1(net_chennai,'industrial_nn')
map_residential_nn <- visual_count1(net_chennai,"residential_nn")
map_retail_nn <- visual_count1(net_chennai,'retail_nn')


grid.draw(
  grobTree(
    rectGrob(gp=gpar(fill="#ecf6ff", lwd=0, col = "#ecf6ff")),
      grid.arrange(map_water_nn, map_restaurant_nn, map_road_nn,
                   map_industrial_nn, map_residential_nn, map_retail_nn, nrow = 2)
    )
  )

Principal Component Analysis

The Principal Component Analysis (PCA) is a dimensional reduction technique used to transform high-dimensional data into a lower-dimensional space while preserving as much variance as possible. The PCA below is used to identify the most important variables in the dataset and their correlation with litter, ultimately creating synthetic variables that can augment the data input for the model. The first output is a table that depicts the general metrics found in the independent variables, such as standard deviation, variance, and the proportion of variance explained by each component.

corr <- st_drop_geometry(net_total) %>% dplyr::select(!c(uniqueID,cvID,city,country))
corr_nor<- scale(corr)
corr_matrix <- cor(corr_nor)
#ggcorrplot(corr_matrix)
data.pca <- princomp(corr_matrix)
pca_sum <- summary(data.pca)
data.pca$loadings[, 1:2]
##                          Comp.1      Comp.2
## count                0.03910654  0.02824418
## water                0.06939559 -0.16334301
## water_nn            -0.25294001  0.17990491
## waste                0.06006527 -0.06645713
## waste_nn            -0.28343700  0.02087865
## restaurant           0.12232184 -0.04222381
## restaurant_nn       -0.27714032 -0.08945392
## road                 0.16341393  0.41090861
## road_nn             -0.13750912 -0.31917998
## industrial           0.04557497 -0.23972791
## industrial_nn       -0.12543259  0.22518605
## residential          0.12280232 -0.03065005
## residential_nn      -0.16985709 -0.10984955
## retail               0.08343909 -0.10307191
## retail_nn           -0.27734971  0.02874484
## avg_pop              0.23916125  0.24137571
## sum_pop              0.25998533  0.29686304
## water_sig            0.09268665 -0.18719624
## water_sig_dis       -0.26523016  0.18214397
## waste_sig            0.10033340 -0.10307020
## waste_sig_dis       -0.25677004  0.09727076
## restaurant_sig       0.13120168 -0.05896123
## restaurant_sig_dis  -0.28062883 -0.01118523
## road_sig             0.05390253  0.23012861
## road_sig_dis        -0.09843262 -0.27218982
## industrial_sig       0.05587300 -0.26867581
## industrial_sig_dis  -0.15335016  0.23961425
## residential_sig      0.12014661 -0.10190999
## residential_sig_dis -0.25040796  0.03631134
## retail_sig           0.10677570 -0.11630429
## retail_sig_dis      -0.23770606  0.03603264
## count_un             0.05246863  0.02625072

The Eigenvalues are the amount of variance explained by each component, quantifying the importance of each component in the dataset. The first dimension explains the vast majority of the variance, with a sharp decrease thereafter. These results mean that we only need 1 dimension of overlap when evaluating our model and additional dimensions of evaluation will not be useful thereafter. A single dimensional view means having all the independent and dependent variables overlaid onto a single space provides the greatest insight, and any variations of such will not be useful.

fviz_eig(data.pca, addlabels = TRUE) + ggtitle("Eigenvalues from Principal Component Analysis") + plotTheme()

The squared cosine value (Cos2) shows the quality of the representation of the variables on the factor map. The closer the value is to 1, the better the representation of the variable. What we notice particularly about these variables with the largest quality of representation is that they are statistical factor rather than a numeric value. Almost each independent variable with a ~0.75 value is either a KNN or significance value. The only exception being sum_pop (population density) and avg_pop (average population).

fviz_cos2(data.pca, choice = "var", axes = 1:2)+ plotTheme()

The PCA Variables graph shows the correlation of each variable in comparison to the first two dimensions. The arrows that represent each variable have varying directions and lengths, indicating mostly little correlation between the first dimension and second dimension. But the few variables that have relatively longer lengths have stronger relationships between both dimensions and can be used for the model. To better understand their importance we evaluate their Squared Cosine. This graph serves as the basis for choosing the several variables. Based on the result, the selected variables included ‘waste_sig_dis, restaurant_sig_dis, residential_sig_dis, water_sig_dis, residential_nn, industrial_sig_dis, industrial_sig, restaurant_sig, industrial_nn, road_sig_dis, residential_sig, restaurant_sig, restaurant’ as the ‘shortened model’ independent variables. Each variable with a ’_sig’ suffix indicates where the presence of the variable, considering KNN and Moran’s I, have significant impact on where the dependent variable, litter, is located.

n_colors <- 100 
mako_gradient <- viridis(n_colors, option = "mako")
fviz_pca_var(data.pca, col.var = "cos2",
             gradient.cols = mako_gradient,
             repel = TRUE) + plotTheme()

Model Building

Our model goal is to predict the relative likelihood of litter accumulation in a given area based on the independent variables. The model will be built using the Random Forest and Linear Regression algorithms. . We take both of these models/results and compare them to determine a mixed-use approach that factors both results with equal weight.

Random Forest Model

The Random Forest model is a machine learning algorithm that uses an ensemble of decision trees to predict the likelihood of an event occurring, the event being the presence of trash. The following random forest model utilizes the ‘TidyModels’ package. The data is then split to a 75-25 train-test set. We specify the randomforest to function as a ‘regression’ to specify the type of output to be evaluated, that being the ‘count’ of litter in each cell. The data is then divided into 10 folds for cross-validation to measure accuracy among each iteration. The model is then tuned using a grid search to find the optimal hyperparameters at each level of different tree (decision) size, ranging between 500 - 2000 decisions. The best-performing model iteration is selected based on the Root-mean-square deviation metric, measuring the accuracy of the forecasted results of each decision. The mean error depicted in the table by each forest (iteration) lies around 0.844 - 0.847. The first result, Preprocessor1_Model10, has the best result with the lowest RMSE of 0.844. This model will be used for our final mixed-used approach.

set.seed(123)
net_tt_nor <- st_drop_geometry(net_total_li) %>%
  dplyr::select(-c(uniqueID,cvID,city,country))
net_tt_nor$city <- net_total_li$city
net_tt_nor$uniqueID <- net_total_li$uniqueID
net_tt_temp <- net_tt_nor%>%dplyr::select(!c(uniqueID,city))

net_tt_split <- initial_split(net_tt_temp, prop = 0.75)
train_data <- training(net_tt_split)
test_data <- testing(net_tt_split)

rf_model <- rand_forest(
  mode = "regression",
  trees = tune(),         
  min_n = tune(),
  mtry = tune(),
) %>% set_engine("ranger")

rf_recipe <- recipe(count ~ ., data = train_data)

cv_folds <- vfold_cv(train_data, v = 10)

rf_grid <- rf_grid <- grid_regular(
  trees(range = c(500, 2000)),   
  min_n(range = c(9, 15)),       
  mtry(range = c(6, 30)),        
  levels = c(4, 3, 5) 
)

rf_workflow <- workflow() %>%
  add_model(rf_model) %>%
  add_recipe(rf_recipe)

rf_results <- tune_grid(
  rf_workflow,
  resamples = cv_folds,
  grid = rf_grid
)

rf_results_df <- as.data.frame(show_best(rf_results, metric = "rmse"))

final_rf <- finalize_workflow(
  rf_workflow,
  select_best(rf_results, metric = "rmse")
)

final_rf <- last_fit(final_rf, split = net_tt_split )

#results <- collect_metrics(final_rf)
#print(results)

temp_ts <- predict(final_rf$.workflow[[1]], test_data) %>%
  rename(Prediction = .pred)
ts_bind <- cbind(test_data,temp_ts)

#best_results <- select_best(rf_results, metric = "rmse")
temp <- predict(final_rf$.workflow[[1]], net_total) %>%
  rename(Prediction = .pred)
temp_cat <- risk_level(temp,'kmeans')
df_rf_rst <- cbind(net_total,temp_cat)

Linear Model

The linear regression model, also built from ‘TidyModels’, takes a similar approach to predicting the count of litter in each cell. The Linear Regression model is a statistical model that uses a linear relationship between the dependent and independent variables to predict the continuous value of litter being present. The model is trained using the same train-test split as the Random Forest using the ‘lm’ method, known as a linear method. Next, it is trained using the ‘quasi’ family to account for the overdispersion of the data. Then cross-validated using 60 folds to measure the accuracy. The results are then shown in the map below in terms of a risk-level category. Level 1 being the lowest, assuming little-no litter risk, and 5 being the highest, our newly ‘selected’ site areas to direct zero-waste sites on. The map depicting Chennai, India shows a scatter of category-5 assessment areas depicting the far north and south of the city to be the primary areas of focus. These areas are also not where our litter data was recorded which shows the model’s ability to predict areas of high risk without prior data, exactly as intended.

set.seed(223)
train_control <- trainControl(method = "cv", number = 60)

model <- train(count ~ ., data = train_data, method = "lm",family = "quasi",trControl = train_control)

test_temp <- predict(model,net_total)
test_cbind <- cbind(net_total,test_temp)%>%
  rename(Prediction = test_temp)

temp_risk_cat <- st_drop_geometry(risk_level(test_cbind,'kmeans')) %>%dplyr::select(Risk_Category)
test_cbind <- cbind(test_cbind,temp_risk_cat)

Mixed Model Build

The Random Forest and Linear models are depicted side-by-side.

When creating our final results, as mentioned, both are given equal weight input. We are looking for where the two models coincide and contrast in their predictions for litter assessment. These two results then provide an area where to dedicate zero-waste site resources. The final map, depicting into the 5 categories, narrows down where to focus resources in both a small enough area to a general scope yet large enough to generalize a search area that does not attempt to depict individual cells. This balance between specificity and generality develops a definitive solution and action plan for Urban Ocean.

test_cb_rst <- st_drop_geometry(test_cbind) %>%
  dplyr::select(Prediction,Risk_Category)%>%
  rename(Pre_lr = Prediction,
         Risk_lr = Risk_Category)
rst_total <- cbind(df_rf_rst,test_cb_rst)%>%
  rename(Pre_rf = Prediction,
         Risk_rf = Risk_Category)
rst_total <- rst_total %>%
  mutate(Prediction = 0.4*Pre_rf + 0.6*Pre_lr)
rst_total <- risk_level(rst_total,'kmeans')
rst_total <- rst_total%>%
  mutate(Risk_Category = 6 - Risk_Category)
test_cbind <- test_cbind %>%
  mutate(Risk_Category = 6 - Risk_Category)
df_rf_rst <- df_rf_rst %>%
  mutate(Risk_Category = 6 - Risk_Category)

grid.draw(
  grobTree(
    rectGrob(gp=gpar(fill="#ecf6ff", lwd=0, col = "#ecf6ff")),
    grid.arrange(
      city_viz('Chennai', test_cbind, 'Linear Regression Model'),
      city_viz('Chennai', df_rf_rst, 'Random Forest Model'),
      nrow = 1
    )
  )
)

grid.draw(
  grobTree(
    rectGrob(gp=gpar(fill="#ecf6ff", lwd=0, col = "#ecf6ff")),
    grid.arrange(
      city_viz('Chennai', rst_total, 'Mixed Model')
    )
  )
)

Comparative Model and Error Analysis

Both Random Forest regressions and Linear Regressions have their benefits and drawbacks. The Random Forest model is known for its ability to handle large datasets and high dimensionality, but can be prone to overfitting and cannot handle ‘data noise’ as well as a Linear model. Random Forest models also tend to outperform linear regression in prediction accuracy when there are non-linear relationships between dependent and independent variables. We did not assume a linear or non-linear relationship between litter and the independent variables which also emphasizes the multi-modal approach. The Linear Regression model is known for its simplicity, but can be prone to underfitting and may not perform well if there is non-linear relationships among the variables. To evaluate the performance of both models, we compare the residuals by plotting them on a histogram to visualize the distribution and a scatter plot for the relationship.

The RandomForest scatterplot shows a weak relationship between the actual and predicted values, with an underestimation of the actual when the value exceed 2. These results tell us the model can indicate lower-risk areas better than higher-risk areas but have yet to identify a correlation between the independent and dependent variables. Meanwhile the Linear Regression residuals have a similar result but has slightly more curvature in the line of best fit (shown in green) which indicates mildly better performance when predicting places with higher actual values.

The histograms explain the same story differently. The Linear Model has residuals that over and underpredict the actual values, while the Random Forest consistently underpredicts the actual values. Now these results are where the mixed-model approach is incorporated. We must note that the data itself is already bias and skewed because of the inherent nature of litter, and the true verification of the model’s integrity is the real-world comparison to where it predicts.

Predicted vs Observed Residuals

We primarily conducted model comparison through our plotting of residuals. The following plots visualize the difference between the predicted and observed values for each of the three models.

library(grid)
rf_test <- ts_bind %>% dplyr::select(count,Prediction)
lr_rst <- predict(model,test_data)
lr_test <- cbind(test_data,lr_rst)%>% rename(Prediction = lr_rst) %>% dplyr::select(count,Prediction)
rf_test$Residuals <- rf_test$count - rf_test$Prediction
lr_test$Residuals <- lr_test$count - lr_test$Prediction

mixed_rst <- cbind(lr_test%>%dplyr::select(-Residuals),rf_test$Prediction) %>%
  rename(lr_pred = Prediction,
         rf_pred = 'rf_test$Prediction') %>%
  mutate(Prediction = 0.5*rf_pred + 0.5*lr_pred,
         Residuals = count - Prediction)

grid.draw(
  grobTree(
    rectGrob(gp=gpar(fill="#ecf6ff", lwd=0, col = "#ecf6ff")),
    grid.arrange(
      ggplot(lr_test, aes(x = count, y = Prediction)) +
        geom_point() +  
        geom_segment(aes(xend = count, yend = Prediction, x = count, y = count), 
                     linetype = "dotted", color = "#696969") + 
        geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "#1c3aa9") +
        labs(x = "Actual Value", y = "Predicted Value", title = "Linear Regression Model Residuals") +
        plotTheme(),
      ggplot(rf_test, aes(x = count, y = Prediction)) +
        geom_point() + 
        geom_segment(aes(xend = count, yend = Prediction, x = count, y = count), 
                     linetype = "dotted", color = "#696969") + 
        geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "#1c3aa9") +
        labs(x = "Actual Value", y = "Predicted Value", title = "Random Forest Model Residuals") +
        plotTheme(),
      nrow = 1
    )
  )
)

grid.draw(
  grobTree(
    rectGrob(gp=gpar(fill="#ecf6ff", lwd=0, col = "#ecf6ff")),
    grid.arrange(
      ggplot(mixed_rst, aes(x = count, y = Prediction)) +
        geom_point() + 
        geom_segment(aes(xend = count, yend = Prediction, x = count, y = count), 
                     linetype = "dotted", color = "#696969") + 
        geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "#1c3aa9") +
        labs(x = "Actual Value", y = "Predicted Value", title = 'Mixed Model Residuals') +
        plotTheme(),
      nrow = 1
    )
  )
)

The Multi-Vision Model features a better distribution than the previous two with a mild right skew still existing, similarly to the RandomForest. This indicates there is greater consistency in this model and, if tuned for accuracy as well, is our best performer in the environmental assessment of relative litter accumulation risk.

The next visual features all three models compared. Here we see the Multi-Vision (bottom) having the best bell-curve representation. The Linear Model (middle) has a less consistent, duel distribution and is not as ideal as the Multi-Vision. The Random Forest (top) has a more consistent distribution but is not as accurate as the Multi-Vision because of the greater skew. The Multi-Vision model is the best performer.

Now we examine the scatterplot of multi-vision residuals. It still shows an identical relationship as the previous two, which may not be as desired, but also does not tell the full story. The model is not perfect, but it is the best performer in the context of the data and the model’s purpose. The meaning of the model is better explained by the Q-Q plot. The Q-Q plot shows the residuals of the model in comparison to a normal distribution of the actual dataset. The closer the points are to the 45-degree line, the more similar the dataset is to the theoretical distribution, indicating greater consistency. The data points are not perfectly aligned but are very close for the most part, creating a mild curvature in the middle, indicating skewness in the data. However, the tail-end of the points of higher theoretical quantities far divert from the line of best fit. These outliers deviate at an exponential rate but only after 1.5 theoretical quantiles, with a more systemic deviation between 1 and 1.5 quantiles. Overall, the results indicate that, despite the imperfections implied with the bias litter data, the model performs well relative to the circumstances and capabilities.

Histogram of Residuals

grid.draw(
  grobTree(
    rectGrob(gp=gpar(fill="#ecf6ff", lwd=0, col = "#ecf6ff")),
    grid.arrange(
      ggplot(rf_test, aes(x = Residuals)) + 
        geom_histogram(bins = 30, fill = "grey") +
        geom_vline(xintercept = 0, linetype="dashed", color = "black") +
        labs(title = "Random Forest Model Residuals") + 
        xlab("Values") + 
        ylab("Frequency") + 
        plotTheme(),
      
      ggplot(lr_test, aes(x = Residuals)) + 
        geom_histogram(bins = 30, fill = "grey") +
        geom_vline(xintercept = 0, linetype="dashed", color = "black") +
        labs(title = "Linear Regression Model Residuals") + 
        xlab("Values") + 
        ylab("Frequency") + 
        plotTheme(),
      
      ggplot(mixed_rst, aes(x = Residuals)) + 
        geom_histogram(bins = 30, fill = "#1c3aa9") +
        geom_vline(xintercept = 0, linetype="dashed", color = "black") +
        labs(title = "Mixed Model Residuals") + 
        xlab("Values") + 
        ylab("Frequency") + 
        plotTheme(), 
      
      nrow = 3
    )
  )
)

grid.draw(
  grobTree(
    rectGrob(gp=gpar(fill="#ecf6ff", lwd=0, col = "#ecf6ff")),
    grid.arrange(
      ggplot(mixed_rst, aes(sample = Residuals)) +
  stat_qq() +
  stat_qq_line(colour = "#1c3aa9") +
  ggtitle("Normal Q-Q Plot") +
  xlab("Theoretical Quantiles") +
  ylab("Sample Quantiles") +
  plotTheme(),
  nrow = 1
    )
  )
)

Model Results

As Chennai, India being our example city for our model, the purpose of the demonstration is to also prove replicability among other cities. As a result, we have applied the model to 11 other cities around the world. The results are shown below. Each city has its own unique characteristics and features that are evaluated with the same variables. The results are shown in the 5-category risk levels and are consistent with the Chennai example. Each city has its own special results that cater to the local context. This newly found insight shall narrow the area of focus for Urban Ocean’s zero-waste sites, empower the local community, and provide a more sustainable future.

risk_v_new <-function(model_data,litter_data,model,city){
  model_data$Risk_Category <- as.factor(model_data$Risk_Category)
  ggplot() +
    geom_sf(data = model_data, aes(fill = Risk_Category), colour = NA) +
    geom_sf(data = litter_data, size = .3, colour = "#1c3aa9") +
    scale_fill_manual(values = risk_palette, guide = "none") +
    labs(title=model) +
    mapTheme(title_size = 8)
}
city_viz_new <- function(cities,data,model){
  temp <- data %>% filter(city == cities)
  risk_v_new(temp,get(cities),model,cities)
}

risk_Chennai <- city_viz_new('Chennai', rst_total, 'Chennai, India')
risk_Bangkok <- city_viz_new('Bangkok', rst_total, 'Bangkok, Thailand')
risk_Can_Tho <- city_viz_new('Can_Tho', rst_total, 'Can Tho, Vietnam')
risk_Melaka <- city_viz_new('Melaka', rst_total, 'Melaka, Malaysia')
risk_Mumbai <- city_viz_new('Mumbai', rst_total, 'Mumbai, India')
risk_Panama_City <- city_viz_new('Panama_City', rst_total, 'Panama City, Panama')
risk_Pune <- city_viz_new('Pune', rst_total, 'Pune')
risk_Salvador <- city_viz_new('Salvador', rst_total, 'Salvador, Brazil')
risk_Santa_Fe <- city_viz_new('Santa_Fe', rst_total, 'Santa Fe, Argentina')
risk_Santiago <- city_viz_new('Santiago', rst_total, 'Santiago, Chile')
risk_Semarang <- city_viz_new('Semarang', rst_total, 'Semarang, Indonesia')
risk_Surat <- city_viz_new('Surat', rst_total, 'Surat, India')



grid.draw(
  grobTree(
    rectGrob(gp = gpar(fill = "#ecf6ff", lwd = 0, col = "#ecf6ff")),
    grid.arrange(
      risk_Chennai, risk_Bangkok, risk_Can_Tho, risk_Melaka,
      risk_Mumbai, risk_Panama_City, risk_Pune, risk_Salvador,
      risk_Santa_Fe, risk_Santiago, risk_Semarang, risk_Surat,
      ncol = 4
    )
  )
)

Web-Based Dashboard

Link to our dashboard

Our web-based dashboard, entitled Marine Litter Assessment Dashboard, visualizes our model results and other data analysis in an interactive way.

WILL REPLACE)
WILL REPLACE)

Once we have obtained the prediction results from our machine learning model, we input them into a web application. At the core of this application is an interactive map that allows users to view the probability of each county’s region transitioning from permeable to impermeable in the future by selecting various layers. The probability interval is divided into five equal parts based on its highest value, with the high probability change areas highlighted in red. Users can further click on the “check layer” to view specific information about these high probability conversion areas, including their past land cover types and the specific numerical values of their probability of conversion. The high-risk areas are marked in red, making them easier to locate quickly. Users can also select other relevant information from a dropdown list, such as complete land cover maps from 2014/2018 and census tract data. This information can assist decision-makers in balancing their decisions on whether to protect certain areas.

Conclusion

Sun Metro Optimizer aims to use our data-driven, politically unbiased, and equity and revenue optimizing model to understand improved bus transit route allocation. With accurate forecasts of ridership trends, Sun Metro transit planners can see areas that may be underserved by transit, and ensure that whilst they reach these regions, they optimize routes for ridership and equity.